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Abstract. The machine learning community has recently devoted much 
attention to the problem of inferring causal relationships from statistical 
data. Most of this work has focused on uncovering connections among 
scalar random variables. We generalize existing methods to apply to col- 
lections of multi-dimensional random vectors, focusing on techniques ap- 
plicable to linear models. The performance of the resulting algorithms 
is evaluated and compared in simulations, which show that our methods 
can, in many cases, provide useful information on causal relationships 
even for relatively small sample sizes. 

1 Introduction 

Many techniques have recently been developed for inferring causal relationships 
from data over a set of random variables [1I2I3I4I5I6I7I8I9] . While most of this 
work has focused on uncovering connections among scalar random variables, 
in many actual cases each of the variables of interest may consist of multiple 
related, but distinct, measurements. For instance, in fMRI data analysis one 
is often interested in the functional connectivity among brain regions, and for 
each such region-of-interest one has data measured from a set of multiple voxels. 
Typically, in these cases, some aggregate of each area is computed, after which 
the standard approaches are directly applicable. However, it can be shown that 
not only may information be lost when computing aggregates, but the outputs 
of such methods may not even be correct in the large sample limit. 

A simple example illustrating one of the problems inherent with working 
with aggregates is the following. Consider three sets of variables with causal 
connections X — > V — > Z, i.e. the variables in X may influence the variables in 
y, but not directly the variables in Z, and the variables in y may influence the 
variables in Z. In this case, each variable x £ X is independent of each z £ Z 
conditional on the full set of mediating variables y . However, when replacing the 
variables of each group with their respective mean value (the typical aggregate 
used), denoted by x, y, and z, in general we obtain x A/l z \ y |1I10| . Thus, 
it is important to develop methods for causal discovery that exploit the full 
information available, as opposed to only aggregates of the data. 

Towards this end, in this paper we extend two existing approaches |5I6) de- 
signed for causal discovery among scalar random variables to the case of random 
vectors (i.e. groups of variables), both exploiting any kind of non-Gaussianity 
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present in the data. We also extend a recent method [8] for inferring the causal 
relationship among two arbitrarily distributed multi-dimensional variables to an 
arbitrary number of such variables. After describing the resulting algorithms, we 
evaluate and compare their performance in numerical simulations. 

2 Model and Problem Statement 

Throughout the paper, we will use the term 'group' to denote a set of un- 
derlying variables all belonging to the same (multi-dimensional) random vector 
representing a single concept (e.g. one region in fMRI analysis). We use the term 
'variable' to represent a single scalar random variable belonging to one of the 
groups. Thus, for g = 1, . . . , G, let X s denote group g, and let the random vector 
x g = (x[ 9 \ . . . , Xn]) T collect the n g random variables belonging to group g. We 
assume that the groups X g can be arranged in a causal order K — (fci, . . . , kc), 
such that the causal relationships among the groups can be represented by a 
directed acyclic graph. The data generating process is assumed to be a set of 
linear equations, given by 



with Bfc 4) £. arbitrary (real) matrices of dimension x n^, containing the 
direct effects from group Xk j to group X^. . The vectors of disturbance terms 
efc i are assumed to be zero mean, and mutually independent over groups, i.e. 
efc i _LL efc. , i ^ j, but are allowed to be dependent within each group. If we 
arrange the groups in a causal order K and define x = [x^, . . . ,Xk G ) and e = 
(ejtj, . . . , ej, G ), we can rewrite Equation ([I]) in matrix form as x — Bs+e with B 
a lower block triangular matrix. The model reduces to standard LiNGAM (Linear 
Non-Gaussian Acyclic Model, |4l5j ) when Vg : n g — 1 and all disturbances e are 
non-Gaussian. It also includes the model of [6] when G = 2, n\ = n,2 = 1 and the 
disturbances are non-Gaussian. Finally, it contains as a special case the noisy 
model of [5] when G = 2 (but with no restriction on the n g and e). 

We assume that all variables in x are observed, and that the grouping of these 
variables is known. Given merely observations of x generated by Model ([T]) (i.e. 
B and e are unknown) , we want to estimate the unknown causal order K among 
these groups. We denote the data matrix of observations over the variables x as 
X = (Xi, . . . , Xg) T , where each column corresponds to one observation and each 
row to one variable. The observations are grouped according to the G groups, 
arranged in a random order, such that the first n\ rows correspond to group Xi, 
the following rii rows to group X2, and so on. 

We note that our model family is equivalent to that given by [?]■ The main 
difference between our approach and theirs is that they do not assume to know 
which variable belongs to what group, which results in algorithms exponential in 
the number of involved variables, whereas our algorithm explicitly builds upon 
such knowledge, allowing to construct computationally and statistically more 
efficient algorithms, polynomial in the number of groups. 
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3 Method 

The overall algorithm for finding a causal order among the groups follows the 
approach introduced in [5]. We first search for an exogenous group (Section 
and then 'regress out' the effect of this group on all other groups (Section 
We iterate this process to generate a full causal order over the G groups. 



3.1), 



3.2). 



3.1 Finding an Exogenous Group 

We generalize three existing methods to search for an exogenous group, formally 
defined as below. Note that since the connections among the groups are assumed 
to be acyclic, there always exists at least one such exogenous ('source') group. 

Definition 1 A group Xj is exogenous if for Xj all matrices Bj.i of Equa- 
tion (JTJ) are zero. 

GroupDirectLiNGAM As our first approach, we generalize the idea of Di- 
rectLiNGAM !5] to find an exogenous variable, to finding an exogenous group. 
The following lemma, which corresponds directly to Lemma 1 in [5], states a 
criterion to find an exogenous group using regressions and independence tests. 

Lemma 1. Let x follow Model (fll) with non- Gaussian disturbance terms e. Let 
(i) 

r/ := Xi — Cxj be the residuals when regressing Xi on Xj using ordinary least 
squares (OLS). A group Xj is exogenous if and only if Xj _LL for all i =/= j. 

The proof of this lemma, and the proof of Lemma [2] in Section [372] are left to the 
online appendix at |http: / / www.cs~h elsinki.fi /u/ entner/GroupCausalOr der /| 

To apply Lemma [TJ in practice, we need to test for (in) dependence between 
two vectors of random variables, and combine the results of several such in- 
dependence tests. Assuming that the test returns p-values pji under the null 

hypothesis of Xj _LL r[ , i ^ j, we can get a measure of how exogenous group 
Xj is by combining these p-values using Fisher's method ~ Tj. This means that 
we select as the exogenous group the one minimizing 

^— '■>■¥= 3 

To obtain the p-values pji we can test for joint dependence of the two vec- 
tors Xj and r\ using the Hilbert Schmidt Independence Criterion (HSIC, [T~]). 
which, however, requires many samples to detect dependencies for high dimen- 
sional vectors. Alternatively, we can perform pairwise tests of each variable in 
Xj against each variable in using nonlinear correlations, and combine the 
resulting rij x rii p-values appropriately. Details are left to the online appendix. 



Pairwise Measure Our second approach is based on modifying the pairwise 
measure ~ ] designed for inferring the causal relationship between two linearly 
related non-Gaussian scalar random variables x and y. If the true underlying 
causal direction is from x to y, the model is defined as y = px + e y with x _LL e y . 
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As pointed out in Section 2, this is just a special case of our more general model. 
The (normalized) ratio of the log likelihoods for the two possible causal models is 
given by R(x, y) = (log L(x — > y) — log L(y — > x)) /m, where m is the sample size 
and L the likelihood of the specified direction, under some suitable assumption 
on the distributions of the disturbances. If the true underlying causal direction 
is x — > y, then R(x,y) > in the large sample limit. Symmetrically, if x <— y, 
then R(x,y) < in the limit. 

To use the ratio R(- , •) to find an exogenous group X.,-, the naive approach 
is to calculate R(x k ,Xj ) for each pair with x k £ Xj, k = l,...,rij, sCj € 
Xi, I = l,...,m,i ^ j, and combine these measures. However, even if Xj is 
exogenous, these pairs do not necessarily meet the model assumption because of 
the dependent error terms within each group, and hence there is no guarantee 
for correctness even in the large sample limit. This approach, termed the Naive 
Pairwise Measure, may however have a statistical advantage for small sample 
sizes (see Section [4]). 

To obtain a consistent method (simply termed Pairwise Measure in Sec- 
tion [Ij), we replace the second variable of the pairs (x k \x^) with a quantity 
which guarantees that the model assumption is met if Xj is exogenous: We 
first estimate the regression model xf' = Y^t-i hk x ^ + r l,(i)- ^ ^ s exoge- 
nous then the regression coefficients b lk - are consistent estimators of the true 
total effects (when marginalizing out any intermediate groups). Hence, defining 



45 : = ^ -T,lU.- k ^ k hk x { = hk x k+ri,{i) yields a pair (x k j) , z£{) meeting the 



model assumption of [6] if Xj is exogenous. Thus, in this case, R(x k , zi \) > 0, 
in the limit, for all fc, I, and i ^= j. On the contrary, if Xj is not exogenous the 
measure can take either sign, and simulations show that it is unlikely to always 
obtain a positive one. A way to combine the ratios is suggested in [5] , which can 
be modified for the group case as 



That is, we penalize each negative value according to its squared magnitude and 
adjust for the group sizes. We select the group minimizing this measure as the 
exogenous one. 

Trace Method Our third method for finding an exogenous group is based on 
the approach of [819] . termed the Trace Method, designed to infer the causal order 
among two groups of variables X and y with n x and n y variables, respectively. 
If the underlying true causality is given by X — > V, the model is defined as y = 
Hx + e, where the connection matrix B is chosen independently of the covariance 
matrix of the regressors £ := cov(x, x), and the disturbances e are independent 
of x. Note that this method is based purely on second-order statistics and does 
not make any assumptions about the distribution of the error terms e, as opposed 
to the previous two approaches where we needed non-Gaussianity. The measure 
to infer the causal direction defined in [5] is given by 
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A x ^y := log (tr(BSB T )/ % ) - log (tr{t)/n x ^ - log (V(BB T )/n a ) (4) 

where tr(-) denotes the trace of a matrix, S an estimate of the covariance matrix 
of x, and B the OLS estimate of the connection matrix from x to y. The measure 
for the backward direction Ay^x is calculated similarly by exchanging B with 
the OLS estimate of the connection matrix from y to x and X with the estimated 
covariance matrix of y. If the correct direction is given by X —¥ V, Janzing et 
al. [5] (i) conclude that Ax^,y ~ 0, (ii) show for the special case of B being an 
orthogonal matrix and the covariance matrix of e being AI, that Ay^x < 0, 
and (iii) show for the noise free case that Ay^x > 0. Hence, the underlying 
direction is inferred to be the one yielding A closer to zero [8]. In particular, if 
|Z\x->y| / |^iy->-x| < Ij then the direction is judged to be X — s- y. 

We suggest using the Trace Method to find an exogenous group Xj among G 
groups in the following way. For each j, we calculate the measures Ax^x t and 
Axi^Xj, for all i ^ j, and infer as exogenous group the one minimizing 

V U) = (Ax^xJ Ax^x,) 2 ■ (5) 

3.2 Estimating a Causal Order 

Following the approach of [5], after finding an exogenous group we 'regress out' 
the effect of this group on all other groups. Since the resulting data set follows 
again the model in Equation ([I]) having the same causal order as the original 
groups, we can search for the next group in the causal order in this reduced data 
set. This is formally stated in the following lemma, which corresponds to the 
combination of Lemma 2 and Corollary 1 in [5] . 

Lemma 2. Let x follow Model (fTl), and assume that the group Xj is exogenous. 

(i) 

Let r 2 := Xi — Cxj be the residuals when regressing cc, on Xj using OLS, 
for i = 1, . . . , G, i ^ j, and denote by r^' the column vector concatenating all 
these residuals. Then r^' = B^r^ + follows Model ([I]). Furthermore, the 
residuals in follow the same causal order as the original groups Xi, i =/= j . 

Using Lemma [2j and the methods of Section |3.1| we can formalize the approach 
to find a causal order among the groups as shown in Algorithm [T] 

3.3 Handling Large Variable Sets with Few Observations 

The OLS estimation used in Algorithm [T] requires an estimate of the inverse 
covariance matrix which can lead to unreliable results in the case of low sample 
size. One approach to solving this problem is to use regularization. For the L 2 - 
regularized estimate of the connection matrix we obtain C itj = X.Xj (XjX^ + 
AI) -1 = cov(Xj,Xj) m (mcov(Xj,Xy) + AI) -1 , with m the sample size and A 
the regularization parameter, see for example |13j . In particular, this provides a 
regularized estimate of the covariance matrix. 



6 



Doris Entner, Patrik O. Hoyer 



Algorithm 1 (Estimating a Causal Order among Groups) 

Input: Data matrix X generated by Model Q, arranged in a random causal order 
Initialize the causal order K := []. 
repeat 

Find an exogenous group Xj from X using one of the approaches in Section [3. 1| 
Append j to K. 

Replace the data matrix X with the matrix R^' concatenating all residuals 
R$ , i 7^ j, from the regressions of cc,: on Xj using OLS: 

Xj = C X + Rp with d,j = cov(X l , X 3 -) cov(Xj, X 3 -) _1 . 

until G — 1 group indices are appended to K 
Append the remaining group index to K. 



Another approach is to apply the methods of Section |3.1| for finding an ex- 
ogenous group to N data sets, each of which consists of G groups formed by 
taking subsets of the variables of the corresponding original groups. We then 
calculate measures j = 1, . . . , G, n = 1, . . . , N, as in Equations ^ or 
|5|, for each such data set separately, and pick the group Xj* which minimizes 
the sum over these sets to be an exogenous one, i.e. 

j* =argmin V u$ (6) 

J — 'l<n<JV 

where fi„ is the measure of group j in the n th data set. We then can proceed 
as in Algorithm [l] to find the whole causal order. 

Note that the same approach can be used when multiple data sets are avail- 
able, which are assumed to have the same causal order among the groups but 
possibly different parameter values. An example for such a scenario is given by 
fMRI data from several individuals. An equivalent of Equation ^ was suggested 
in [T3] for the single variable case with multiple data sets. 

4 Simulations 

Together, the methods of Section 3 provide a diverse toolbox for inferring the 
model of Section 2. Here, we provide simulations to evaluate the performance of 
the variants of Algorithm 1, and compare it to a few ad hoc methods. Matlab 
code is available at http://www.es. helsinki.fi/u/entner/GroupCausalOrder/ 

We generate models following Equation (JlJ by randomly creating the con- 
nection matrices B^t. , i > j with, on average, s% of the entries being nonzero 
and additionally ensure that at least one entry is nonzero, to ensure a complete 
graph over the groups. To obtain the disturbance terms e/ Ci for each group, we 
linearly mix random samples from various independent non-Gaussian variables 
as to obtain dependent error terms within each group. Finally, we generate the 
sample matrix X and randomly block-permute the rows (groups) to hide the 
generating causal order from the inference algorithms. 
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(a) 100 models with 5 groups (b) 50 models with 3 groups 

Fig. 1. Sample size (x-axis) against error rate (y-axis) for various model sizes and 
algorithms, as indicated in the legends (abbreviations: GDL = GroupDirectLiNGAM; 
nlcorr, HSIC: nonlinear correlation or HSIC as independence test; TrMeth. = Trace 
Method; PwMeas. = Pairwise Measure; ICA-L = modified ICA-LiNGAM approach; 
DL = DirectLiNGAM on the mean- variables; lOsets = Equation |6| on N = 10 data 
sets; L2reg = L 2 -regularization for covariance matrix). The dashed black line indicates 
the number of mistakes made when randomly guessing an order. 



We compare the variants of Algorithm [T] to two ad hoc methods. The first 
one is a modified ICA-based LiNGAM approach |3] where instead of searching 
for a permutation yielding a lower triangular connection matrix B (i.e. finding a 
causal order among the variables), we search for a block permutation yielding a 
lower block triangular matrix B (i.e. finding a causal order among the groups). 
Secondly, we compare our approach to DirectLiNGAM [5], when replacing each 
group by the mean of all its variables)]] 

We measure the performance of the methods by computing the error rates 
for predicting whether X^ is prior to Xj, for all pairs (i, j), i < j. 

Results for simulated data of sample size 200, 500 and 1000 generated from 
100 random models having 5 groups with either 6 or 12 variables each, and 
s = 10%, are shown in Figure [I] (a). As expected, most methods based on 
Algorithm [l] improve their performance with increasing sample size. The only 
exception is the Trace Method on the smaller models; to be fair the method was 
not really designed for so few dimensions. Overall, the best performing method 
is the Pairwise Measure, closely followed by GroupDirectLiNGAM for the larger 
sample sizes. The ad hoc methods using DirectLiNGAM on the mean perform 
about as well as guessing an order (indicated by the dashed black line), whereas 
the modified ICA-LiNGAM approach performs better than guessing. However, it 
does not seem to converge for growing sample size, probably due to the dependent 
errors within each group, which is a violation of the ICA model assumption. 

We next replace each group by a subset of its variables of size m = 1, . . . , n g , 
and apply Algorithm [T] to these subgroups. As expected, the larger m is, the less 
ordering mistakes are made. Details can be found in the online appendix. 

1 We do not compare our results to methods such as PC [I] or GES [3], as they 
cannot distinguish between Markov-equivalent graphs. Hence, in these simulations, 
they cannot provide any conclusions about the ordering among the groups since we 
generate complete graphs over the groups to ensure a total causal order. 
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Finally, we test the strategies described in Section 3.3 for handling low sample 
sizes in high dimensions on 50 models with 3 groups of 100 variables each, using 
200, 500 and 1000 samples, and s = 5%. For L 2 -regularization, we choose the pa- 
rameter A using 10-fold cross validation on the covariance matrix. When taking 
subgroups, we use N = 10 data sets, and each subgroup containing ten variables. 
The error rates are shown in Figure [l] (b) (we only show the L 2 -regularized re- 
sults if they were better than without regularization) . Unreliable estimates of the 
covariance matrix seem to affect especially the Trace Method, and the Pairwise 
Measure on the smaller sample sizes. On the smallest sample, using subsets seems 
to be advantageous for most methods, however, the best performing approach 
is the Naive Pairwise Measure, which, however, does not seem to converge to be 
consistent, where as GroupDirectLiNGAM and the Pairwise Measure are. 

In general, the simulations show that the introduced method often correctly 
identifies the true causal order, and clearly outperforms the simple ad hoc ap- 
proaches. It is left to future work to study the performance in cases of model 
violations as well as to apply the method to real world data. 
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